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Abstract. 

, The Huber's criterion is a useful method for robust regression. The adaptive least absolute 
shrinkage and selection operator (lasso) is a popular technique for simultaneous estimation and 
variable selection. In the case of small sample size and large covariables numbers, this penalty is not 
very satisfactory variable selection method. In this paper, we introduce an adaptive reversed version 
of Huber's criterion as a penalty function. We call this penalty adaptive Berhu penalty. As for elastic 
^ net penalty, small coefficients contribute their i\ norm to this penalty while larger coefficients cause 
it to grow quadratically (as ridge regression). We show that the estimator associated with criterion 
such that ordinary least square or Huber's one combining with adaptive Berhu penalty enjoys 
C/3 the oracle properties. In addition, this procedure encourages a grouping effect. This approach 
^ is compared with adaptive elastic net regularization. Extensive simulation studies demonstrate 
satisfactory finite-sample performance of such procedure. A real example is analyzed for illustration 
^ purposes. 

Keywords. Adaptive Berhu penalty; concomitant scale; elastic net penalty; Huber's criterion; 
oracle property; robust estimation. 

Availability. The software that implements the procedures on which this paper focuses is de- 
veloped in Matlab. It is available at http://ljk.imag.fr/membres/Laurent.Zwald, 
OO 

1 Introduction 

(N 

Data subject to heavy-tailed errors or outliers are commonly encountered in applications which may 
appear either in response variables or in the predictors. We consider here the regression problem 
^ with eventually responses subject to heavy-tailed errors or outliers. In this case, the Ordinary 
Least Square (OLS) estimator is reputed to be not efficient. To overcome this problem, the least 
absolute deviation (LAD) or Huber type estimator for instance can be useful. On the other hand, an 
important topic in linear regression analysis is variable selection. Variable selection is particularly 
important when the true underlying model has sparse representation. To enhance the prediction 
performance of the fitted model and get an easy interpretation of the model, we need to identify 
significant predictors. Scientists prefer a simpler model because it puts more light on the relationship 
between the response and covariates. We consider the important problem of robust model selection. 

The lasso penalty is a regularization technique for simultaneous estimation and variable selection 
Q25J). It consists to introduce i\ penalty. This penalty forces to shrink some coefficients. In [5], 
the authors show that since lasso uses the same tuning parameters for all the regression coefficients, 
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the resulting estimators may suffer an appreciable bias. Moreover in the case of the small sample n 
with larger number of covariables p, the lasso selects at most n variables. Recently, [18j [16j E3] and 
[M] show that the underlying model must satisfy a nontrivial condition for the lasso estimator be 
consistent in variable selection. Consequently, in some cases, lasso estimator cannot be consistent in 
variable selection. For instance, [31] assigns adaptive weights for penalizing differently coefficients 
in the l\ penalty and calls this new penalty the adaptive lasso. These adaptive weights in the 
penalty allow to have the oracle properties. Moreover, the adaptive lasso can be solved by the same 
efficient algorithm (LARS) for solving lasso (see [34J ) . Notice that recently (see this penalty 

has been combined with Huber's criterion. The estimator associated with this procedure enjoys 
oracle properties. 

On the other hand, if there is a group of variables among which the pairwise correlations are very 
high, then the lasso penalty tends to select only any one variable from this group. Ridge regression 
(£2 penalty) does not make variables selection but tends instead to share the coefficients value among 
the group of correlated predictors. Moreover if there exist high correlations among predictors, the 
prediction performance of ridge regression dominated the lasso |25J. In order to overcome to this 
drawback of the lasso, [35] proposes a new regularization technique that combines the lasso and the 
ridge penalties. They call their method "elastic net" (en). The en penalty is the sum of the lasso 
and the ridge penalties. However even for usual case, it does not deemed to be an oracle procedure. 
In [6], the author proposes a new version of the elastic net called adaptive elastic net (adaptive en) 
which inherits some of the desirable properties of the adaptive lasso and elastic net. He proves its 
oracle properties. In [TjJ], the author proposes to use a reversed version of Huber's criterion (called 
Berhu) as a penalty function. Let us recall that the Huber criterion (see [12J) is a hybrid of squared 
error for relatively small errors and absolute error for relative large ones. The Berhu penalty is such 
that relatively small coefficients contribute their t\ norm to this penalty whiles larger ones cause 
it to grow quadratically. This hybrid sets some coefficients to as the lasso does while shrinking 
the larger coefficients in the same way as ridge regression. In [Tj5], the author provides some way 
in order to optimize some objective function constituted of both the Huber criterion and the Berhu 
penalty in a no-adaptive form. Nevertheless nothing is shown about asymptotic feature. 

In this paper we introduce an adaptive Berhu penalty with concomitant. We use it with the 
ordinary least square criterion or the Huber's one in order to take into account of data subject to 
heavy-tailed errors or outliers. We show that the estimator associated with such procedures enjoys 
the oracle properties (in the standard case of least square criterion and in the case of the Huber's 
one). In addition this procedure encourages a grouping effect in the following way. The spirit of the 
Berhu penalty with concomitant implicitly is to create one group with the largest coefficients. This 
group is penalized in a £2 way like the grouped lasso of [31 J to avoid to remove anyone of these largest 
coefficients. The smallest coefficients are treated individually by an ^-penalty. The en procedure 
relies on the fact that, in order to have a grouped effect, we want to keep or delete together high 
correlated variables. We show that when combining with ordinary least squares criterion, the Berhu 
penalty leads to this "grouping effect property" . 

The rest of the article is organized as follows. In Section 2, we introduce the adaptive BerHu 
penalty and show that it induces a grouped effect. In Section 3, we give its statistical properties. 
Section 4 is devoted to simulation and illustration over real data. This study compares the least 
square criterion and the Huber's criterion with various penalties such as adaptive lasso, ridge, en 
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and adaptive Berhu. All technical proofs are relegated to the Appendix. 

2 The Berhu penalty 

2.1 The adaptive Berhu 

Let us consider the linear regression model 

i/i = a* + xf/3* + ere;, i = l,...,n, (1) 

where Xj = (xn, . . . , x ip ) T is the p-dimensional centered covariable (that is Y^i=i x « = 0); °* ls the 
constant parameter and (3* = (Z^, . . . , (3*) T are the associated regression coefficients. We suppose 
that cr > and e« are independent and identically-distributed random errors with mean and 
variance 1, when it exists. Indeed in the sequel we do not need existence of variance. Let A = {1 < 
j < P, Pj 7^ 0} and po = \A\. In variables selection context, we usually assume that (3* ^ 0, for 
j < Po an d 0j = 0, for j > p for some po > 0. In this case the correct model has po significant 
regression variables. We denote by /3_4 the vector given by the coordinates of /3 the index of which 
are in A. 

When po = p, the unknown parameters in the model are usually estimated by minimizing the 
ordinary least squares criterion. To shrink unnecessary coefficients to 0, [25] proposed to introduce 
a constraint on the £i-norm of the coefficients: 

n p 

^(^-a-xf/3) 2 + A n ^|/3,|, 
i=i j=i 

where A n > is the tuning parameter. Notice that the intercept a does not appear in the penalty 
term since it is not reasonable to constrain it. 

Lots of reproaches have already been done to the Lasso (see e.g. [22] )■ In this paper, we focuse 
on the fact that when some variables are highly correlated, the i\ penalty tends to keep only one 
variable for each group. The literature already contains attempts to solve this problem. To begin 
with, grouped lasso procedures have been proposed first in [31] where the t\ penalty is imposed on 
predefined groups of coefficients. More precisely, the penalty is the Zi-norm of the vector composed 
of the £ 2 - n orm of each group of coefficients: 

n L 

J2(Vi -a- xf/3) 2 + A n vW)il|a, 
»=i j=i 

where (/3)j is the coordinates bloc corresponding to the j-th group. Consequently, the sparsity is 
encouraged at the group level (see also [32] and [9] page 91 for further references). In our framework 
it is diffcult to use the approach of group lasso since there is no obvious way for choosing the groups 
a priori. Next, [35] has proposed the Elastic Net. The naive Elastic Net is obtained by minimizing: 

n p v 

J2(Vi -a- xf P) 2 + Ai,„ l&l + V %> ( 2 ) 

i=l j=l 3=1 
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and the Elastic Net is a modification of this. In this procedure, the penalty imposed on the small 
coefficients is the sum of an ^-norm and a squared £ 2 - n orm. Moreover, ridge penalty reduces 
the variance of the estimates by imposing a small squared norm of all the coefficients. However, 
it suffices to constraint the largest coefficients to be small to get this reduction of variance: by 
definition, the smallest one do not need to be constrained to be small. Consequently, we consider a 
penalty which is quadratic only on the largest coefficient. Following [19] . we focused on a penalty 
that acts separately on small and large coefficients. We consider the Berhu penalty defined by 



M*) = \ *+r? \ (3) 

I ~^L~ \ Z \ > L -> 

where L is any positive real. As Huber criterion, the Berhu function needs to be scaled. Precisely, 
the penalty can be defined by 



3=1 



where r is a scale parameter to be determined. To do that we can as in [19] replace the penalty 
term by 



pen(/3) = min ( pr + r V^£>^ ( — | | 

T>0 V U vWj 



Fan and Li [5] showed that the lasso method leads to estimators that may suffer an appreciable 
bias. Furthermore they conjectured that the oracle properties do not hold for the lasso. Hence Zou 
[34] proposes to consider the following modified lasso criterion, called adaptive lasso, 

n p 
i=l j=l 

where w adl = (wf l } . . . } wf l ) is a known weights vector. This modification allows to produce sparse 
solutions more effectively than lasso. Precisely, Zou [31] shows that with a proper choice of A n and 
of w adl the adaptive lasso enjoys the oracle properties. Such a penalty has been used in the en 
penalty (see [B]). 

Here we propose to make the Berhu penalty adaptive. That is we consider the following penalty 
min reM P adfe (/3,r) with 

( r(EU4- + ^U^(^)) *f ->°> 
P a (P,t) = < o if = 0, t = 0, 

[ +oo if f3=£0, t = 0. 

where w adb = {w1 db 1 . . . , Wp db ) is a known weights vector. We will see at Section 3 that the resulting 
estimator enjoys the oracle properties. Let us notice that [19] introduced the Berhu penalty in his 
no-adaptive form and in the context of robust regression only. Moreover nothing is shown about 
asymptotic feature. 
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In the general case, the (adaptive) Behru penalty behaves like lasso on the smallest coefficients 
and does not delete the largest ones, whatever the correlation structure. That can be what we except 
to a right model selection procedure. This interpretation relies on the following calculation when (3 
is fixed. For instance in the non adaptive case, let us sort the absolute values of the coordinates of 

P: 

\P(p)\<---< IAdI- 

Let k(/3) denote the number of non-zeros coefficients of f3. Then the minimum defined in pen(/3) is 
achieved at 



\ 2Lj9 + L2(g(/3)-l) 



9(/3)-l 

E % 



if (3 7^ and where q((3) is the unique integer between 2 and k(/3) + 1 such that \{3( g tp^\/L < f(/3) < 
-i)\/L. Consequently, 



pen(/3) 



9(/3)-l 



HP) 



E <%) + E \k 



j)\ 



(4) 



i=g(/3) 



The en procedure (or its variant Elastic Corr-Net ^j) relies (explicitly for Elastic Corr-Net) on the 
fact that, in order to have a grouped effect, we want to keep or delete together high correlated 



variables. We will see that it is the case for Berhu procedure in Section 2.4 But we can note here 
different spirit of the Berhu penalty with concomitant: it implicitly creates one group with the 
largest coefficients (see Q). This group is penalized in a £ 2 way like the grouped lasso of [3T] to 
avoid to remove anyone of these largest coefficients. Let us note that as in the grouped lasso penalty, 
the ^2-norm of the q(/3) — 1 largest coefficients is scaled by the squared root of the number of such 
coefficients present in this group. The smallest coefficients are treated individually by an ^-penalty 
(see Q). Consequently, whatever the structure of the correlation matrix, the Berhu penalty with 
concomitant tends to keep all the largest coefficients and to delete the smallest ones. 



2.2 Robust estimation 

To be robust to the heavy-tailed errors or outliers in the response, a possibility is to use the Huber's 
criterion as loss function as introduced in [12] . For any positive real M, let us introduce the following 
function 

z 2 \z\ < M, 



1 j 



2M\z\-M 2 \z\>M. 



This function is quadratic in small values of z but grows linearly for large values of z. The parameter 
M describes where the transition from quadratic to linear takes place. The Huber's Criterion with 
concomitant scale defined by 

( ns + Y: =1 n M {^^-)s if s>0, 

£ n {a,P,s) = I 2iWXr=il^-"-xf/ 3 l if s = 0, 
+oo if s < 0, 
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which are to minimize with respect to s > 0, a and j3. To get a robust scale invariant Lasso type 
procedure, [15] proposes to minimize simultaneously over s, a and the function 

v 

£u(<y,p,s) + \ n J2™? h \Pi\- (5) 

where w adh = (m)* , . . . , Wp) is a known weights vector. The loss function involving a concomitant 
estimation of the scale and location parameter was first proposed by Huber Q12J). We propose here 
to use the concomitant estimation of Huber with the Berhu penalty: 

g M («, /?, s, t) = C n (a, /3, s) + \ n P adb (P, r). (6) 

This criterion is minimized simultaneously over a G R, /3 G MP, s G M.+ and r G IR+. So we get 
another scale invariant robust location estimation. Contrary to the procedure proposed in [15] , the 
largest coordinates of /3 are quadratically penalized. 



2.3 Tuning parameter estimation 

Let us now consider the problem of tuning parameter estimation. To run these procedures we 
have to determine the weights vector in the adaptive penalties, the regularization constant A„, the 
parameter M for Huber's criterion and L for Berhu's penalty. Usually the weights vector is given 
by (see [3U [15]) Wj dl = | / gj"P en |-7 ; j = l ; . . . ; p ) where 7 > and f3 un P en denotes the unpenalized 

estimator. For instance, in the least squares context p un P en is the ordinary least squares estimator. 
In fact this estimator only must be root-n-consistent estimator of /3*. Let us note that the theoretical 
part is given for these forms of weights vector and 7 is fixed to be equal to 1 for the numerical results. 
For Huber's Criterion with concomitant scale we need value for M. As in [12J, we fix M = 1.345. 
For Berhu's penalty we fix as in [19] . L = M. Let us note that we do not have any justification 
to do that. However in practice we have observed that these parameters have little impact on the 
results. 

To find optimal values for A n , we use BIC-type criterions. When using least squares criterion we 
consider the classical BIC criterions (|22j), That is it is recommended to select A n minimizing 



log £ 



,i=i 



{yi ~ a Xn - xf/§A„) ) + k x 



log(n) 



n 



over A n , where k\ n denotes the model dimension. Following [28J and [5U], we determine k\ n by the 
number of non-zero coefficients of the estimator. When using Huber's criterion, we consider the 
BIC-type procedure introduced in [15]: we select A n by minimizing 



log (Ch ("An: + k 



log(n) 

An 



2n 1 

over A n . As previously, k\ n denotes the number of non-zero coefficients of (3\ n . 
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2.4 The Berhu penalty with concomitant induced a grouped effect 

An algorithm is said to satisfy the grouping effect property if high correlated variables lead to similar 
estimations of the corresponding coefficients. Such a property was a motivation to introduce the 
Ridge Regression ([H]). Indeed, the normal equations associated to the Ordinary Least Square do 
not imply any stability of the coefficients associated to highly correlated variables. Now, adding a 
squared £ 2 - n orm penalty, the corresponding normal equations imply a stability of the coefficients 
associated to highly correlated variables. Such a reasoning leads to a bound quantifying the grouping 
effect of the Elastic Net ([33]). Such a property was generalized to the adaptive Elastic Net in [BJ 
and also proved for the algorithm of [3J. 

The goal of the following theorem is to provide a quantitative description for the grouping effect 
of the Berhu penalty with concomitant. 

Theorem 1. Let 7 > and (a adb , (3 adb , f adb ) be a minimizer of 

n 

1=1 

over a eR,(3 eW and t <E R+. We suppose that X n > 0, f5f b ^ 0, /3f b ^ 0. In this situation, the 
following bound holds: 

2Lt, 



\(5f b wf b - pfwf\ < — iMhVINIi + INI! - ■!(;. ,.,■;>.,- (7) 

wheve Ljij — mm 1 1, ^ a dbi ^adbi ^f ad6 ) 2 

To obtain this result for Huber's loss is a difficult task. That is an open question that is left for 
future work. Let us remark that when the variables are standardized in £ 2 -norm, this leads to 



21 r I 

\Pf b wf b -(3f b wf b \ < — \\y\\ 2 ^ 2(1- C hjXi '.r, 



With 7 = 0, we exactly get the grouping effect property in the non-adaptive case. Let now 7 e M. + . 
The upper bound of equation ([8| is a decreasing function of the correlation Xi T Xj between variables i 



and j (since C{ t j > 0). To ensure that the coefficients fif and (3? become similar if the correlation 



increases, from ([7]), the initial estimator fl un P en used in the weights w^ db has to satisfy the grouping 
effect property. Consequently, this bound effectively provides a quantitative description for the 
grouping effect of the Berhu penalty with concomitant if, for example, the initial estimator is 
obtained with a ridge penalty. 

As compared with the Elastic Net bounds provided by [33] and [B], we do not have to suppose that 
padb anc j padb j-^yg ^g sam e sign. Moreover, in our case, the grouping effect occurs more accurately 



for large coefficients (see Section 2.1) which is the natural situation where it has to happen. For the 
adaptive elastic net, [6] also have to suppose that the initial estimator satisfies the grouping effect 
property. Moreover, [36] recomands to choose a non-adaptive elastic net estimator as an initial 
estimator in the weights of the adaptive elastic net. 
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In the simulation study below, the initial estimator used for the weights of all adaptive methods is 
the corresponding unpenalized estimator. This choice avoids choosing a supplementary parameter 
(e.g. the regularization parameter of ridge regression) and also avoids numerical problems du to 
too small coefficients of the initial estimator. This unpenalized parameter does not satisfy the 
grouping effect property but comparisons between various methods remains fair. Moreover, in the 
simulation studies involving the Berhu penalty with concomitant, the variables were not normalized 
in £ 2 - n orm. Indeed, using the way we get the design matrix X, explicit calculations when variables 
are normalized or not leads to the same order for the corresponding upper bounds. 

3 Oracle Properties 

In this section we give the asymptotic properties of the concomitant estimator of Huber with the 
Berhu penalty. We show that it enjoys the oracle properties. We have the same property by 
replacing Huber's loss by least squares one's. When necessary, we give the difference (for example 
for the assumptions) between the two loss functions. 

Let X denotes the design matrix i.e. the n x p matrix the i th rows of which is xf . We will use 
some of the following assumptions on this design matrix. 

(Dl) maxi<j<„ ||xi||/y / n — > as n — > oo . 

(D2) X T X/n — > V as n — > oo with V\ t \ > 0, where Vy t i is the first p x p Q bloc of V, corresponding 
to the covariables associated with non zero coefficients. 

Assumption (Dl) and (D2) are classical. It can be seen as a "compacity assumption": it is 
satisfied if the variables are supposed to be bounded. When considering least squares criterion as 
loss function, we need only the assumption (D2) (see for example [M]) while considering Huber's 
criterion we need the both (Dl) and (D2) (see [15]). 

Let us denote by e a variable with the same law as e^, % = 1, . . . , n. As in |15| . we define 



s* = argminF(s), 

s>0 






The following assumptions on the errors are used in the following: 



(NO) The distribution of the errors does not charge the points ±Ms*: 



P [ae = ±Ms*} = 0. 
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(Nl) The variable e is symmetric (i.e. e has the same distribution as — e). 



(N2) For all a > 0, P [e G [-a, a]} > . 



Note that (NO) holds if e is absolutely continuous with respect to the Lebesgue's measure and 
(N2) is satisfied if, moreover, the density is continuous and strictly positive at the origin (which 
is assumption A of [29]). Condition (Nl) is natural without prior knowledge on the distribution 
of the errors and (N2) ensures that the noise is not degenerated. It is noticeable that there is no 
integrability condition assumed on the errors e. These three assumptions stand for the Huber's loss. 
For the penalized least squared estimators (e.g. [14\ and [31]) we assume that 6j are independent 
identically distributed random variables with mean and has a finite variance. 

Let (a ad \ (3 ad \ s Had \ f Hadb ) be defined by the minimizer of Q Hadb (-) where w adh = l/|/3j" pen | 7 with 
7 > 1/3 and f3 un P en a root-n-consistent estimator of (3* (i.e. y/n(j3 — f3*) = O p (\)). We denote A n = 
{1 < J < P, Pf adh ± 0}. Let us remark that if A n > 0, the argminimum (a Hadb , (3 Hadb , s Hadb , f Hadb ) 
exists since the criterion Q Hadb {-) is a convex and coercive function. 

In the following theorem we show that, with a proper choice of A n , the proposed estimator enjoys 



the oracle properties. Its proof is postponed in Appendix 5.3 



Theorem 2. Suppose that A n /n 7A1 / 2 — > 0, A n n^ 7 l >' 2 — > oo, A n — > oo and A n > 1/3. Let us also 
assume that conditions M > 1, p > 0, (NO), (Nl), (N2), (Dl) and (D2) hold. Moreover, 
for j = l,...,p, the weights in Q Hadb are Wj db = l/|/3j npen | 7 where p un P en is a root-n- consistent 

estimator of (3* . Then, any minimizer (a nadb , f3 Hadb ^s Hadb , f nadb ) of Q Hadb satisfies the following: 



Consistency in variable selection: f[A n — A] — > 1 as n — >■ +oo. 
Asymptotic normality: 



n a 



X Tiadb 



A- 



s Hadb - s* 



n 



Ko+z (o,s 2 ) 



where S 2 is the squared block diagonal matrix 



E 



diag 



E[Z 2 



and where 



D s * = [ ( x 2 e 2 l k£ |<Ms*] , A 

'(76 



-Pfkel < Ms* 



z = i + n 



M 



s* n ' M (s*)- 



Analogous results hold for the least squares loss function. In this case (M = +oo), the asymptotic 
variance matrix K[H' Ms (o~e) 2 ]V^ / (4A 2 *) obtained in Theorem [2] is equal to o~ 2 V{^ and we find the 
asymptotic variance of theorem 2 of 
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4 Some numerical experiments 



In this section, we consider the both criterions least squares and Huber's one combined with the 
following penalties: adaptive lasso, ridge, adaptive en and adaptive Berhu. We call these methods 
respectively ad-lasso, ridge, ad-en, ad-Berhu , Huber-ad-lasso, Huber-ridge, Huber-ad-en 
and Huber-ad-Berhu. The adaptive weights are obtained from the corresponding unpenalized 
estimator and 7 = 1. 



4.1 Simulation Results 



Here our aim is to compare the finite sample performances of these procedures. Paragraph 4.1.1 



presents the studied models. The way simulations are conducted is described in |4.1.2 and an insight 



of conclusions is provided in paragraph 4.1.3 



4.1.1 Models used for simulations 

The models used to compare the performances of the algorithms are inspired by those presented in 
[33] . They involve groups of highly correlated variables: the block- variables model ([33], example 
4). Let us remark that [33] considered a model without intercept. We now recall the definition of 
this model in a different way. Our formulation allows to clearly identify the groups of influencing 
correlated variables. They all have the form y = l n + X/3* + ere, where l n denotes the vector 
of IR n composed of ones and y (resp. e) represents the response (resp. error) vector y n ) T 
(resp. (ei, e n ) T ). The design matrix X is constructed as follows. The rows of X are given by 
n independent gaussian vectors A/4o(0, £). They are normalized such that the corresponding p- 
dimensional covariables are centered (as assumed in Q). The variance matrix of the variables is 
a block diagonal matrix of size 40. The first block is the squared matrix of size 5 composed of 1 
outside the diagonal and taking values 1.01 on the diagonal. The second and third blocks are the 
same as the first one. The last block is the identity matrix of size 25. The vector of true coefficients 
(3* is defined as follows: the 15 first coordinates are equal to 3 and the 25 last coefficients are 0. This 
means that, in this model, only the 15 first variables are influencing the response. The 25 others 
are pure noise. Amongst the 15 influencing variables, there is three groups of highly correlated 
variables: these groups are composed of the first five variables, the next five ones and the five last 
ones. The variables of different groups are independent. As compared with ([I]), this means that 
the intercept of the model is a* = 1 and the number of variables (without the intercept) is p = 40. 
Depending on the nature of the noise, various models are considered. 

• Model 1: block-variables model, gaussian noise. In this case, the standard deviation of the 
noise is o = 15 and the variables ex, • • • , e n are independent standard normal variables. Except 
for the part of the intercept parameter, this exactly example 4 of |35j . 

• Model 2: block-variables model, mixture of gaussians. In this case, the variables ei, • • • , e n 
are independent mixture of gaussians. Precisely, with probability 0.9, e is a standard normal 
variable and with probability 0.1, e is a centered normal with variance 225. The value o = 
3.1009 has been chosen such that the standard deviation of the noise is the same as in model 
5. The common value is std(ae) = 3.1009^1 + 0.1(225 - 1) = 15. 
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• Model 3: block-variables model, double- exponential noise, e = D/ ^Jv&i^D) and a = 10.6. The 
distribution of D is a standard double exponential i.e. its density is x £ R — > e~' z '/2 and 
var(L>) = 2. 

These three models create a grouped variables situation. They allow us to illustrate the grouped 
selection ability of the penalties. They can be divided into two types. The first type contains light 
tailed errors models (1) whereas the second type is composed of heavy tailed errors models (2 and 
3). Models 1 allows to quantify the deterioration of the performances of the robust algorithms in 
the absence of outliers. Thinking about the maximum likelihood approach, the least squares loss 
(resp. Huber's loss) is well designed for Models 1 (resp. 2,3). 

4.1.2 Assessing prediction methods 

To compare the performances of the various algorithms in the fixed design setting, the performances 
are measured both by the prediction errors and the model selection ability. For any considered 
underlying models, we generate a first set of n training designs (xi, • • • ,x n ) and a second set of 
m =10 000 test designs (x n+ i, • • • ,x n+m ). These two sets are centered in mean to stick on the 
theoretical definition Q of the model (i.e. ensures that XT=i x « = 0)- Since the theoretical results 
are established in fix design framework, the training and test design are fixed once and for all: 
they will be used for all the data generations. 100 training sets of size n are generated according 
to definition of the model. All the algorithms have been run on the 100 training sets of size 
n =100, 200, 400 and their prediction capacity have been evaluated on the test design set of size 
m =10 000. To compare the prediction accuracy, the Relative Prediction Errors (RPEs) already 
considered in [33] are computed (see also [JS] for explicit definition). Figures [XJ g and [3j provide 
the boxplots associated with the 100 obtained RPE. 

The model selection ability of the algorithms are reported in the same manner as done by [29], [25] 
and [5] in Tables [TJ [2] and [3j Ridge penalty procedures are not reported since they do not constitute 
variables selection procedures. To provide the indicators defined below, a coefficient is considered to 
be zero if it absolute value is strictly less than 10 -5 (i.e. its five first decimals vanish). In all cases, 
amongst the 100 obtained estimators, the first column (C) counts the number of well chosen models 
i.e. the cases where 15 first coordinates of $ are non-zeros and the 25 last coefficients are zeros. To 
go further in the model selection ability analysis, we consider other measurements. The first (in the 
second column (O)) represents the number of overfitting models (i.e. those selecting all the non- 
zeros coefficients and at least one zero coefficient). The second (in the third column (U)) reports 
the number of chosen underfitting models (i.e. those not selecting at least one non-zero coefficient). 
In this way, all the 100 models are counted one time. Columns (0) and (U) aim to explain the 
results obtained in (C). The column (Z) is the average number of estimated zeros, the column 
(CZ) provides the average number of correctly estimated zeros and (TZ) recall the theoretical zeros 
number. The column (CNZ) is the average number of correctly estimated non zeros and (TNZ) 
recall the theoretical non zeros number. Models selection abilities are closely related to the accuracy 
of estimations of the coefficients. This fact is illustrated by boxplots of the coefficients estimations 
(see Figures [ij [5] and [6]) . 

Concerning the hyperparameter choices, the regularization parameters associated with adaptive 
lasso or Berhu penalties are chosen by BIC criterion on each of the 100 training sets as described 
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at Section 2J3 The same grid has always been used for each method. It is composed of 100 points 
log-linearly spaced between and 1400 for Berhu and 200 points log-linearly spaced between and 
10 000 for lasso. For Huber's loss, the simulation studies report the performances obtained with 
M = 1.345. This value has been recommended by Huber in [T2|. For adaptive Berhu penalty, we 
report the performances obtained with L = M = 1.345 Let us remark that it is possible to chose the 
M and L parameters from the data (for example by cross-validation simultaneous with the tuning 
parameter). But in practice we do not observe some improvement to make it data adaptive. For 
ridge-type procedures, the hyperparameter is chosen as usually by 5-fold cross-validation on each of 
the 100 training sets. The grid is composed of 100 points log-linearly spaced between and 1400. 
For en-type procedure, we use the similar protocol as in [35]: we first pick a relatively small grid of 
values for A2, n over {0, 0.01, 0.1, 1, 10, 100} and 25 points log-linearly spaced between and 5000 
for Ai >n . Then the both parameters are chosen simultaneously by 5-fold cross-validation. 



4.1.3 Comparison results 

Tables [T] [2] and [3] present the performances in terms of selection model ability. First we see that 
whatever the model the behavior of the methods are the same. The lasso and en penalties methods 
lead in general to underfitting models (columns U). It is surprising for the en penalty. Indeed the 
penalty imposed on the small coefficients is the sum of an £i-norm and a squared ^2-norm. This 
implies that the obtained penalty is closer to differentiability than the £i-penalty. As shown in PQ, 
if the penalty is far from differentiability, more small coefficients are deleted. For these examples, 
the en penalty as the same behavior as the lasso one. As a consequence, these penalties have a 
relatively high number of zeros with correct zeros number (columns Z) very close to the true one 
(columns TZ). But the correct non zeros number (columns CNZ) is very low in comparison with the 
true one (columns TNZ). The fact that en and lasso type methods underfit is reduced for Model 2 
and for Huber loss. In all cases, these methods almost never identify the right model. The Behru 
penalty leads to some compromise between over and under fitting. We point out that contrary to 
en and lasso type methods, there is a case where Berhu type method identifies the right model a 
reasonable number of times: it is Model 2 with Huber loss. It is a little less good in terms of correct 
zeros but much better in terms of non zeros number. 

This behavior occurs on the quality of estimation of the non zero coefficients (see Figures |1J [5] 
and [6])). Let us note that we have only considered the first coefficient fix and that the conclusions 
for the other non zero coefficient are the same. The ridge method is given here as a reference since 
it is known to lead good performances in presence of high correlation between the covariables. We 
observe that the Berhu penalty lead to good performance in terms of bias as ridge with higher 
variability than the ridge one. The bias and sometimes the variability are very high for the other 
methods du to their tendency to underfitting. 

Figures [TJ [2] and [3j provide the boxplots associated with RPE. As expected, the ordinary least 
squares loss leads to better performance for Model 1 (excepted for n = 400) and leads to less 
good performance for the Model 3 and especially for the Model 2. We observe that ad-Berhu and 
Huber-ad-lasso provide several extreme values du to numerical instabilities and are often more 
variable. 
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4.2 Prostate cancer data example 

This data set comes from a prostate cancer study (see [21]) and analyzed earlier in the elastic 
net paper by [331 E] • There are eight clinical covariates namely: logarithm of the cancer volume 
(lcavol), logarithm of the prostate weight (lweight), age, the logarithm of the amount of benign 
prostatic hyperplasia (lbph), seminal vesicle invasion (svi), logarithm of the capsular penetration 
(lcp), Gleason score (gleason) and percentage Gleason score 4 or 5 (pgg45). The response is the 
logarithm of prostate-specific antigen (Ipsa). The predictors are are named as 1, ... ,8 in results. 
OLS and the previous methods were applied to these data. 

In [35], the data were divided into to parts: a training set with 67 observations and a test set 
with 30 observations while in [6], they have divided (randomly) the original data set in to training 
and testing set containing 60 and 37 observations respectively. To fairly compare the methods we 
propose to perform a resampling study: we have divided 100 times (randomly) the original data set 
into training and testing set containing 67 and 30 observations respectively. The hyperparameters 
are chosen as in the simulation study. We then compared the performances of the methods by 
computing their RPE on the 100 resampling testing sets (see Table [4]). Contrary to what had been 
observed in [351 [6], our resampling study does not allow us to claim that one method emerges in terms 
of RPE: almost all these methods have similar RPE. We can only say perhaps Huber-ad-lasso 
is slightly less good. Let us notice that we observe a great variability in the choice of A2, n for the 
adaptive en-type procedures (see first column of Table [4]). This is also the case for Huber-ad-lasso. 
As a contrary, the choice of X n for Behru type procedures is more stable (it is comparable to the 
stability of ridge). Figure [7] show (except for OLS and ridge procedures) the histogram associated 
with the selected variables. We see that Berhu penalties leads to good models in terms of sparsity 
in comparison with en penalties. We observe that Behru type procedures are compromise between 
lasso type methods which select too few variables and en type methods which select too many 
variables. 

5 Appendix 

5.1 Computations: software used for numerical optimization 

When the regularization parameter is fixed, to solve all the involved optimization problems we used 
CVX, a package for specifying and solving convex programs [7J [8] . CVX is a set of Matlab functions us- 
ing the methodology of disciplined convex programming. Disciplined convex programming imposes 
a limited set of conventions or rules, which are called the DCP ruleset. Problems which adhere 
to the ruleset can be rapidly and automatically verified as convex and converted to solvable form. 
Problems that violate the ruleset are rejected, even when convexity of the problem is obvious to 
the user. The version of CVX we use, is a preprocessor for the convex optimization solver SeDuMi 
(Self-Dual- Minimization [23] )• 

Let us now recall a well-known fact of convex analysis: the Huber function is the Moreau-Yosida 
regularization of the absolute value function (pUl EQl El])- Precisely, it can be easily shown that 
the Huber function satisfies 

Um(z) = min ((z - vf + 2iW>|) . 
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We can derive the same kind of formulation for the BerHu function leading to a characterization of 
the BerHu function as quadratic optimization problem. Indeed, the function ^ satisfies 

« , n • f w " i L 

D L [z) = mm — - w + \z\ + 



w>lv\z\ \2L 2 

where a V b denotes the maximum of the two real numbers a and b. The proof of this equality is 
trivial since it amounts to minimize a quadratic function on an interval. 

This allows to write our optimization problem in a conforming manner to use CVX. Note that 
[T§] uses an expression of Hm{z) as the solution of a quadratic optimization problem (borrowed 
from the user guide of CVX) to write his problem in a conforming manner to use CVX. However, the 
expression of [19] involves more constraints and more variables than the previous formulation. We 
give here the way to use CVX in order to compute the estimators alpha=a Wa<i/ , beta.=p hLadl and 
s=s nadl . The variable X represents the design matrix X. The unpenalized estimator betaUNP= 
is calculated beforehand (using also CVX) and the regularisation parameter A n is fixed and denoted 
by lambda. 

cvx_begin 

variables alpha beta(p) s v(n) tau w(p) ; 

minimize (n*s+quad_over_lin(y-alpha-X*beta-v, s)+2*M*norm(v, 1) 

+ mu* (tau*norm(betaUNP , 1) +quad_over_lin(w . / (sqrt (abs (betaUNP) ) ) , 2*L*tau) 

+norm (beta . /betaUNP , 1 ) -sum (w . /abs (betaUNP) ) +0 . 5*L*t au*norm ( 1 . /bet aUNP , 1 ) ) ) 

subject to 

s > 0; 

tau > 0; 

w >= L*tau; 

w >= abs (beta) ; 

cvx_end 

Let us remark that betaUNP is computed in the same way but deleting the term multiplied by 
lambda. 

5.2 Proof of Theorem [T] 

Since (3f ^ 0, we have (3 adb ^ and f adb > 0. Consequently, the definition of partial derivatives 
involving Newton's quotient leads to the following KKT conditions by differentiating with respect 
to (3i, f3j and r : 

/ a adb 

2 Xi r ( y - a adb l n - K(3 adb ) + \ n wf b B' L ( ) = 0, (10) 



/ aadb 

2x/ [y - a adb l n - Xp adb ) + X n wf b B' L ^1=0, (11) 



2 



y a-(J-(3q"_£] = y_L 

W J OT +adb 9 



1 \2L \ f adb 

j:\^ db \>LT adh \ \ / / J ; 



U) adb ' 



=1 3 
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The last score equation implies that the set G = {j G [l;p], \$j\ > Lf } is non-empty. Let us 
now distinguish some cases involving this set on indices. To begin with, if both the indexes i and j 

(12) 



belong to G, equations (10) and (11) become 



/ \ B adb 

2x? (y - a adb l n - Xf3 adb ) + \ n wf b j^- b = 



and 



y _ & adb t n - Xf5 aao ) + \ n w° 



adb 



\adb i i 



ft 



adb 



J J^q-adb 



0. 



Substracting the second one to the first one and using Cauchy-Schwarz inequality, we get : 

?• adb -n v nadb 

y - a l n - Xfj 



o T j-adb 

\fiadbpadb _ tif b p-f b \ < — ||.,-; - X 



A, 



3 112 



The definition of (a adb } fi adb , f adb ) as a minimizer implies that, for all r > 0, 



1 w i 



Now, letting r tends to in this inequality, we get: 

A adb -n v nadb 
y - a JL n — X/5 



< y 



(13) 



This leads to equation ^ of the Theorem [T] since C^- = 1 in this case. 

Next, let us consider the case where only one index among {i, j} belongs to G. If i and j are 



switched (if necessary), we can suppose that i e G and j G. In this case, equations (10) and (11) 

-2x ? T (y - a adb t n - X(3 adb ) + X n wf b sign (fc db ) = 0. 



become (12) and 



These two equalities lead to 



rf.adb nadb r t 7 ) ac ^ Q&db 

u/j Wj fJj 



2Lt 



adb 



A — n 



adb I 

j 



J^j-adb 



y-a adb t n -X/3 



adb 



Combining Cauchy-Schwarz inequality and inequality (]13|), this leads to 

2Lf adb 



\ ~ adb nadb n7, ac ^ i padbl 

I "^i Pi "Jj Pj 



< 



A, 



\f3^ db | 

2/1 Is A/ \\xi\\ 2 2 + \\xj\\l - 2- "' 



J^^-adb 



Xi Xj , 



where we have used j ^ G. This implies equation ([7]) of the Theorem [l] since Cij = \(3^ db \/ Lf adb in 
this case. 

Finally, when i and j do not belong to G, using similar arguments we obtain 



* adb nadb * adb nadb\ 



adfe nadb I 



2Lf adfe 
A„ 



lylbv iFilla + If. 



Jll2 



2 2 l/?f 6 4 ad6 | 



]^2j-adb2 



that implies equation of the Theorem since C i:j = \(3f b f3f db \/(L 2 T adb2 ) in this case. 
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5.3 Proof of Theorem H 



The asymptotic normality of this estimator is proved in Step 1 and the consistency in variable 
selection in the Step 2. This proof in an adaptation to our case of the proof given by |34J or [15J. 
The difference with [15] concerns the treatment of the penalty term. So in the following, we will 
use notations similar to the ones of [15]. We will point out the difference between the both proofs. 
Step 1. Let us first prove the asymptotic normality. Let us define U n (u) = 



Q Hadb ((a*,(3*,s*,T*Y +u/y/n) - Q Hadb (a* , (3* 
U„(u) is minimized at 



s*, t*) with u 



P+2j 



E W+ 3 . Obviously, 



u 



(a) 



n a 



lHadb 



- a* J Hadb - (3* , s Haab - s 



Hadb 



n ( ±.Hadb 



n 



The principle of the proof of [31] or [15] is to study the epi-limit of U n . Using the proof of theorem 
3.2 in [15], we only need to study the epi-limit of the penalty term given by 



Pju) = A„ P 



yadb 



n 



P ad \/3*,r*) 



where P adb ((3,T) = P adb (f3,T), if r > 0, oo if r < 0. The epi-limit of this term is given in the 
Lemma [T] . This lemma together with lemma 2 of [15] indicates that U n — > e _d U, where U (u) = 
A s * (uJ.pVui.p + Uq) + D s *u 2 p+l — W T u + Up + 2 2 C(up + 2), if Uj = 0, Vj ^ A, +oo otherwise. Under 
condition (5* ^ 0, equation (25 ) in Lemma 2limplies that Y^jLi \ | 2_7 1|/3*|>Lt* > thus the function 
z — > z 2 C(z) is strictly convex. Moreover, Vx,i is supposed positive definite in assumption (D2) and 
we assume that the noise satisfies (N2). Consequently, U get a unique argmin and the asymptotic 
normality part is proved. 

Step 2. Let us now show the consistency in variable selection part. It suffices to show that 
P [A C An] — > 1 as n tends to infinity and P [A c C A n c ] — > 1 as n tends to infinity. The first claim 
is an easy consequence of asymptotical normality obatined in Step 1. 

Let us now show the second claim. Let j such that (3* = 0. We have to prove that P (3j iadb ^ 
as n tends to infinity. As in [15j, we have for a such j, 



P 



P 



^Uadb ^ Q 



3 



-.Hadb 



< P ^s Hadb ,f Hadb ) = (0,0)] + 



> and s Hadb > and ^ x itj H' M 



a 



Hadb 



T RHadb 



ZHadb 



i=l 



Using similar arguments as in 



we have, as n tends to infinity, 



Since Vs 6 
P 



p[C 

( x ) I > 1) we have 



^Hadb ±."}iadb 
5 i / 



(0,0)] ^0. 



-Hadb 



-Hadb 



< P 



and 2j x ij^M 



a 



Hadb 



T pHadb 



x/3 



gHadb 



s Hadb > and 



-\ n wf adb B' L 



-\ n wf adb B' L 



Vi-OL 



Hadb 



T oHadb 



nHadb 



i=l 



> 



Hadb 
J_ 
-Hadb 



P 



T 



Hadb 
3_ 

Hadb 



A 



n ^Hadb 

n 1 
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As in 1151. we have 



1 n 

i=l 



zHadb 



Op(1), 



and v/n/(A n w^ ac!6 ) -4 0, that implies that P /3^ adb ^ 



— 7- as n tends to infinity. 



5.4 Technical lemma 
5.4.1 Proof of lemma [I] 

Lemma 1. Suppose that A n /n 7A1/2 -)■ 0, A n n (7-1)/2 ->• oo, A n — >• oo, A n > 1/3 and /?* ^ 0. TTjen 
we /iai>e 

u p+2 2 C(u p+2 ) ifUj = 0, Vj A 
+oo otherwise , 

w/iere 



1 Po T 1-7 



P+2J 



2Lr 



3=1 



2 r *( 7 +i) 



Since /3* 7^ 0, Lemma ensures that r* > 0. Consequently, we have P n (u) = Y7j=\ Pn,j(u), where 



\ [ u p+ 2 i_ rfjadb ( * I \ K T ( ^* + v^ 



r*u)f b B L ( £ ) ) if Mp+2 > -VKr*, 
if ii„ +2 = -V%Tr*, 



-00 



lp+2 

and Mj = —y/n(3*, 
otherwise. 



Step 1. First let us prove that 



^2 P n,j(u) ^e-d U 2 p+2 C(u p+2 ) . 
3=1 



(14) 



We show that, for every u fixed in M p+2 , we have this convergence in probability. Since r* > and 
A n —> +00 as n tends to infinity, for n sufficiently large (with respect to a bound depending on 

u p+2 ), u p+2 /^/Ki + t* > and 



w 



where 



Vj G [l,po], Gj : (z 1 ,z 2 ) -> (z 2 + r*)£ L 
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For 1 < j < p such that \(3*\ ^ Lr* , Gj is two times differentiable at and the Taylor- Young 
theorem entails that, V (z\, z 2 ) G R 2 , 

G j (z 1 ,z 2 )^G j (0)+z 1 B' L (^]+z 2 B (^\ + A-\ m>I/n 



2Lt* 

f 2Lt* 3 Wj> LT * ~ Lr* 2 ^> Lt * + *( Zl > Z2 > 



where z 2 )/\\(zi, z 2 \\ 2 — > as (-21,-22) — > , B : z G K — > Bl (z) — zB' L (z) and we have used that 
B" L (Pj/r*) = l\p*i >LT */L. Consequently, for 1 < j < p such that \(3*\ ^ Lr*, 

PnM = + v^ n u P+2 wf b B + %P^1|,;|>^ + V» , (15) 



3 

where 



a n ,j{u) 



71 



Let us now consider 1 < j < p such that \(3*\ = Lr*. When (3* = Lr*, for n sufficiently large 
(with respect to a bound depending on u), 



f) : , , v - W"p+2 , 7 adb II*, M P+2 \ n I T * v 7 ™ I ; _* 



'X n U 

^.adb ' ■■" — 3 \ \ ' A 1 ~" 1 — • ■ _* 1 



Let us consider n sufficiently large (with respect to a bound depending on u) such that Lr*+Uj/ \/n > 
and r*+-Up +2 /v / %i > 0. It is possible since r* > 0. Thus, combined with the assumption A„ — >■ +00 
as n tends to 00, the involved sequence tends to a strictly positive limit as n tends to 00. Since 
X n /n — > as n tends to 00, two cases are possible. Either, a/ X n /nuj < Lu p+2 and 

b n>j {u) = P ntj {u) "^J— = (16) 



or a/ X n /nUj > Lu p+2 and 



\ „7,adb / „,2 * \ T ",adb 



M«) = , '" J x Uyr + + s<^- ( 17 ) 



Similiarly, we get the same result if /3* = —Lt*. Gathering (15) and using B(±L) = 0, we have the 
following decomposition: 



Po Po Po 2 P° 

i=i j=i i=i j=i 

(18) 
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where 



(n) = Mp^yx^y^ 



+ ^5 ' Pj 



\adb 



We now study the convergence of each term. The yri-consistency of f3 un P en implies that w° 
1/1/5* | 7 < +00. Moreover, X n /^/n — )■ as n tends to infinity, thus, by Slutsky's theorem, the first 
three terms of a nj (u) tends to in probability for any (u) G IR P+2 fixed. Concerning the last term 
(the rest), we have that 



Ve > 0,3N e (u), Wn > N e (u), A n £ ( ^g) < e 



n 



7/2 \ 

3 n 2 

— + U P+ 2 



Moreover, (A n /n) n >i is a bounded sequence (since it converges to as n tends to infinity). Thus, 
Xniiuj/ 'y/n,u p+2 / 'y/Ki) — > as n tends to 00. Consequently, for any u G W +2 fixed, the forth 
term of a„j tends to in probability. Using Slutsky's lemma, this entails that, for any u G M. p+2 

IP Ah F 

fixed, a n j(u) — > 0. Concerning the term b n j{u) As previously we have Wj — > 1/|/3*| 7 < +00 and 
A n / \fn — > as n tends to infinity, so, if (3* = Lt*, 

h f ^ L(1 " 7) 2 , 

On,j{U) -> 2 ^ (7+1) M p+2 iL Up+2<0 . 



Similarly, we get the same result if /3* = —Lt*. Concerning the term c nj (u), Property (25) (see 
Lemma [2j is available since (3* 7^ and 



C n ,j(u) — U p+ 2 



A, 




PO 




|^unpen| 7 _ |^| 7 




i 1 \ ' \ './ 

Since /3* 7^ 0, x — )■ |x| 7 is differentiable at (3* and the Taylor- Young theorem entails that 



(19) 



n(r ? T- l/?fl 7 ) =7sign(/3*)|/3*r- 1 ^(/3r Pf " 



junpen 



"" / " n implies that the first term of this 

unpen \ 



with — >■ as 2 tends to /3*. Now, the ^/n-consistency of $ 

expansion is bounded in probability. It also entails that ^ npen JL> ft* which leads to £j f ^J"^"! Ji> 
since — ► as 2 tends to Consequently, the second term of this expansion is also bounded 
in probability and, finally, y/n(\(3™ npen \' y — |/3*| 7 ) = Op{l). Since \ n /n — > as n tends to infinity, 
and | / g* m P e ' 1 |7 J^. |^*|7 ^ q, so c„,j(u) converges in probability to 0. Combining (18) with all these 



convergences, the convergence in probability of (14) is proved. Using first theorem 2.7 (vi) of [26 



and then that convergence in probability is stronger than convergence in distribution (theorem 2.7 
(ii) of [2S]), we get that convergence in probability implies finite-dimensional convergence in (14). 



Theorem 5 of [33] implies that (14) holds since the limit function u —> u 2 +2 C{u p+ 2) is finite. 
Step 2. Next, we treat the sum of terms P n> j for j > p , and first show that 



PO + 1' ' ™,P- 



B 



PO + l' 



Ib p ) 



(20) 
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where Bj = {{ui :p , u p+2 ) G W +1 , 
(i.e. Ia{x) = if x G A and Ia(x) 



Uj = 0} and for a set A, I a denotes the indicator function of A 
+00 otherwise). Let us put 



u 



P n,j(u) 



K.u P+2 \p; npe y- 



(21) 



nl /20unpen^ j g & j-jg]^ se q Uence . 



Since /3 un P en is a \/n-consistent estimator and j G [p + . , , 
Moreover, we have A n /n 7 — > as n tends to infinity, thus Wu p+2 G IR, V%^w p +2|/3j i?ipen | 7 = 
M p+ 2 a/ A n n~ 7 ( \/n I j3^ npen | ) 7 J^. g Using first theorem 2.7 (vi) of [26], we get that convergence in 



probability implies finite-dimensional convergence: y/\ n Up+2\Pj | 7 — >-/-rf 0. Since the involved 
limit function is finite and by convexity, theorem 5 of [13] ensures that we have the epiconvergence 
in distribution. Moreover, Bl (x) > \x\ and Bl (0) = 0, Lemma [3] with q(x) = Bl (x) leads to 



d(q n>j ,I Bj )<T 



An, 



where d is defined as in (26). We have A n — » +00 as n tends to infinity and 2 t T v/ ^"] +1 — y as 
n tends to infinity since r* > 0. Furthemore 2^fn\^ nven V l^n = 2(y/n\/3j npen \)' 1 /Xn/n^' 1 ^ 2 and 
since (] un P en is a A/n-consistent estimator and j G [po + l,p], the numerator is a tight sequence 
and the denominator tends to +00 as n tends to infinity. Consequently, 2y / n|/3j| 7 /A„ —> and 



d(q n j, Ib.) — > 0. Finally, using part (ii) of lemma 1.10.2 page 57 of [27], we have q n j — > e -d Ib 3 - 
The notion of epi-convergence in distribution of convex lower semicontinuous random variables is a 
particular case of weak convergence of a net as stated in definition 1.33 of [27J. Consequently, we 
can use Slutsky's theorem page 32, example 1.4.7 of [27] to ensure that 



Ul:p, Up+2, 



e—d 



(22) 



since is deterministic. Moreover, we have y/\^ l u p+2 \(3^ npen \' 1 — > u -d since we have shown the 



finite dimensional convergence in distribution and since y A n ti p+ 2|/3j| 7 and are finite convex func- 
tions ([2] and [13]). We are now in position to use part (b) of theorem 4 of [13]: gathering (22), 
\/A n M„ + 2|/3"™ pen | 7 — > u -d 0, continuity of and (21), it ensures that P n j -^r e -d Ib^ holds. Since J Bj . 



is deterministic, theorem 18.10 (ii) of |26j ensures that the convergence in probability holds. Now, 
theorem 18.10 (vi) of [26] leads to the convergence in probability in (20). Moreover, convergence in 

r than convergence in distribution thus (20) is proved. 

0,Vz G I}. Thus, for all 



probability is stronger than convergence in distribution thus (20) is proved. 

For all I C [p + l,p], dom (£\ 6J J Bi ) = {(u v . p ,u p+2 ) G RP + \ Ui 
I C [po + l,p] and J C [po + l,p] satisfying / fl J — 0, 

G hit ^dom ^ I Bi J - dom ^ J B . J j , 
where for /, a function defined on M p+1 , dom(/) = {x G lR p+1 //(a;) < +00} and A — B = {a — b, a G 



A, 6 G £>}. Using successively this fact, (20), Theorem 5 of [IT] and theorem 18.10 (iii) (v) (vi) and 
18.11 of [26], we get 



j=po+i i=po+i 



(23) 
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As previously, we can use Slutsky's theorem page 32, example 1.4.7 of [27] to ensure that (23) and 
( 14 ) imply that 



Mi 



Y PnM^ p ^ u n^~A E iB,y p+2 c{u p+2 )\ (24) 

\j=po+i i=i / \i=po+i / 

since u 2 +2 C{u p+2 ) is deterministic. Moreover, we have Y7jLi Pn,j( u ) — > u _d u 2 p+2 C {u p+2 ) since we 
have shown the finite dimensional convergence in distribution and Y^jLi Pn,j( u ) an d u p+2 C(u p+2 ) 
are finite (for n sufficiently large) convex functions ([2] and [H]). Using part (b) of theorem 4 of 
[13]: gathering pi] ), YFjLi P n,j{ u i-P> ~>u-d u p+2 2 C{u p+2 ) and continuity of u p+2 2 C(u p+2 ), it 

ensures that Lemma Q] holds. ■ 



5.4.2 Proof of lemma [2] 

Lemma 2. If (3* ^ then there exists a unique r* > satisfying equation M) and 




Proof. Let us denote by / the following function of r 




This function is convex and /'(•) is continuous, increasing with I'(t) — > Y^j=\ \ as r — >■ +oo and, 
if / 5* 7^ 0, I'{t) — > — oo as r — t- 0. This leads to the existence of r* > by the intermediate value 
theorem. The minimum of / is unique since I' is strictly increasing on each pieces ]0, |/3^/L|[ and 
[\{3( k \/L, \P(k + i)\/L[ for 1 < k < p — 1, continuous and increasing on M.*^, strictly positive at {ftfA/L 
since r(\/3* {p) \/L) = £)J =1 |/3*| 7 > 0. Note that /' is constant on [\Pfc\/L, +oo[. This concludes the 
proof. ■ 



5.4.3 Proof of lemma I 



For /, a function defined on S, we note epi(f), its epigraph given by epi(f) = {(x,t) G SxM./ f(x) < 
t}. ' 

Lemma 3. Let q be a function such that q(0) = and Vx £ K, q(x) > \x\ . We use the notations 
of the proof of lemmalA Let us recall that q n j(ui, p , u p+2 ) = P n j(ui- P , u p+2 ) — y/K l u p+2 \/3j\' r where 



' (^giAr + & + T ") 9 (If 7 ) - (? 



P n ,j{u) = < 



if U p+2 = -y/\^T*, 

and^j = —\fnfi. 



3 ' ' 



otherwise . 
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Then, Wj G [p + l,p], 
where 

+00 



,/ T \ ^lAd k (epi(q nJ ),epi(I Bj )) 

d{qn,j, Ib 3 ) = }^ 2 h — > ( 26 ) 



k=l 

d k is a semi-distance ("constrained Pompeiu-Haussdorf distance") 



dk(epi(q n ,j), epi(I Bj )) = max\d epi r qn - d epi n B .)(x)\ , (27) 

and ds{x) = min\\x — y\\ for a subset S ofK. p+1 . 
yes 

Proof. Let us note that distance d caracterises the epi-convergence of lower semi-continuous 
functions: a sequence {f n } of extended-real-valued lower semi-continuous functions from W +1 epi- 
converges to a extended-real-valued lower semi-continuous function / if and only if d(f n , f) — > as 
n goes ton infinity. We recall that Bj = {(wi :p , -u p+2 ) G IR P+1 , Uj = 0} and for a set A, I a denotes 
the indicator function of A. Let us introduce the set Dj = {(ui :p , Wp+ 2 ) G IR P+1 , Uj = 0andw p+2 > 
— V~Ki T *}- By using the triangular inequality, 

d(q n ,j , Ibj ) < d(q nJ , I Dj ) + d{I D . , I B . ) • (28) 
To begin with, let us show that 

d(I Dj J Bj ) <2-[^. (29) 

Here we use a geometrical point of view. The epigraph of the indicator function I a of a set A 
is the "half- cylinder with cross-section A" i.e. A x R + . Consequently, the epigraph of I B is an 
half-hyperplan supported by the Uj axis and the epigraph of Id 3 is the part of this half-hyperplan 
where, moreover, w p+2 > — \/A^t*. Note that this cut is perpendicular to the w p+2 -axis. So if we 
consider x G M. p+2 such that x p+ i > —y/X^r*, the distance between x and epi(/D 3 ) is reached for a 
point in epi(I B .). Thus 



Vx, ||x|| 2 < k with < VA n r*, 4 P i(/ Dj )(x) = 4 P i(/ Bj )( x )> ( 30 ) 



and if k < yX^T* then (ifc(epi(/D j ), epi(/s j )) = 0. Now the definition (26) of the distance d implies 
that 

d(I Dj ,I Bj )= ^ p ^ 2^ 2*' 



and (29) is proved. 
Next, we show that 



^,/ Pj )< 2 ^ 1 +2-1-^1. (31) 
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For j e [po + ?(0) = implies that 



= fes + X') J ■ , I + / E (32) 



where we set 0/0 = and 

E = {(u 1:p ,u p+2 ), u p+2 > -a/A^t*} U {(u 1:p ,u p+2 ), u p+2 = -a/A^t* anduj = 0}. 

Consequently, q n j(ui- p , u p + 2 ) < Id 3 (ui: P ,u p+ 2). Indeed, it is clear if (ui- p ,u p+2 ) £ Dj. More- 
over, if (ui, p ,u p+2 ) G Dj, q n ,j{ui: P ,u p+2 ) = since q(0) = 0. Consequently, epi (l Dj ) C epi(q nJ ), 
d cp i(i Dj )(-) > d epiiqnij )(.) and 

4 (epi (q nJ ) ,epi (Id.-)) = max ( d epi(/D) (x) - 4 pi ( gn )(z) ) . 

||x||<fc V 3 ' J 

Since Vt G R, q(t) > \t\, it holds that,V(t,r) elxi;, rq(t/r) > \t\ and expression (g entails 

where F nd (ui.. p ,u p+2 ) = X n \uj\\Pj npen \~ 7 /y/n+I E . Consequently, epi (g nj ) C epi (F n>j ), c? ep i( 9n ,,•)(•) > 
4 P i(F n ,,)(.) and 

4 (epi (g nj -) ,epi (l D )) < max (4 P i(/n.)(a:) - d epi{F )(x) ) . (33) 
Now, epi(F n j) = S 1 ! U S 2 where 

Si = {(ui:p,u p+ 2,t) G R p+2 , u p+2 > -v^r^and— < t}, 

and 

5*2 = {{ui:p,u p+2 ,t) G R p+2 , u p+2 = - \f\~nT* ,Uj = 0,andi > 0}. 

Thus, 

de P i(F nJ ) (x) = d Sl (x) A d S2 (x) . (34) 
Easy calculations lead to, Va; G 1R P+2 , 



p+2 
1=1 



(35) 



and 

p+2 



P+2 

d\{x) = inf ^(xi - Zi) 2 = dl pi{fn j) (xi, ■ ■ ■ ,x p ,x p+2 ) + (x p+1 + ^ n T*) 2 \ Xv+i< _^ T ^ 
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where f n ,j(ui: P ) = A n |«j| |/3"™ pen | 1 ' / y/n. If we consider x G W +2 such that ||x||2 < k with k < 
y/XnT*, it satisfies that x p+ i > —y/X^T* and thus d 2 Si (x) = ^e P i(/„ )( x i5''' , x P ,x p+ 2). Technical 
computations leads to 



d Sl (x) 



rp 2 I ,-y» 2 



if Xp +2 < 



> in 



if \ Si < X n +2 < 



1+— ^ 







if x p+2 > 



An 



v^l/3; npen |7 



(36) 



Using explicit expressions (36) and (35), we can show that for any x G IR P+2 such that ||x||2 < k 

(37) 



with k < vA^r*, 



4i (x) < 4, (x 



s 2 \ 



Gathering (37) with (34), for any x G IR P+2 such that ||x||2 < k with k < y/X^T*, 

depi(F nJ ){x) = d Sl (x) = d cpi (j n j )(xi, ■ ■ ■ ,x p ,x p+2 ). 



(38) 



Combining (33), (38) and (30), if k < V\t*, we obtain 



4(epi (q n>j ) , epi (l D ,)) < max (d epi{lB , ) {x 1 , ■■■ ,x p , x p+2 ) - d 

||cc||<fc V J 



The involved objective function does not depend on x p+ \. Moreover, using the form of the con- 
straints, if k < y/X^T*, we get 

d k (epi (q n:j ) ,epi (l D )) < max [d epi (i A .)(x u ■ ■ ■ } x p} x p+2 ) - d epi ^ fnj) (x 1 , ■ ■ ■ ,x p ,x p+2 ) ) ■ 

xf-\ hx2+x2 +2 <fc 2 V J / 

Moreover, since Vui :p G W, lAj{v>\-.p) > fn,j(ui-.p), if k < a/A^t*, 

4(epi (q n ,j) , epi (l Dj )) < d k (epi (f nJ ) , epi (Ja,)), 
and technical computations leads to 



4(epi (f n ,j) , epi (Xt,)) 



AnA/l + 



Finally, using the definition (26), we have 



w r n . \- 4(epi(g„j),epi(J D3 )) x ^ I 

d{q nij ,l D .) < ^ 2 k ^ 9* 



2 A ' 



fc>[VVr*J+l 
k 



Gathering this inequality with the previous one and the fact that J\>i 2* — ^> (31) * s P rove d. 
Using equation (28) with (29) and (31), the bound involved in Lemma [3[holds. ■ 



24 



Acknowledgements 



Part of this work was supported by the Interuniversity Attraction Pole (IAP) research network in 
Statistics P5/24 and by MSTIC project of the Joseph- Fourier University. We are grateful to Anestis 
Antoniadis for constructive and fruitful discussions. 

References 

[1] A. Antoniadis and J. Fan. Regularization of Wavelet Approximations. Journal of the American 
Statistical Association, 96:939-967, 2001. 

[2] M. A. Arcones. Weak convergence of convex stochastic processes. Stat. Probab. Lett., 37(2): 171- 
182, 1998. 

[3] H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and 
supervised clustering of predictors with OSCAR. Biometrics, 64(1): 115-123, 2008. 

[4] M. El Anbari and A. Mkhadri. Penalized regression combining the LI norm and a correlation 
based penalty. Research Report RR-6746, INRIA, 2008. 

[5] J. Fan and R. Li. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle 
Properties. Journal of the American Statistical Association, 96:1438-1360, 2001. 

[6] S. Ghosh. Adaptive elastic net: An improvement of elastic net to achieve oracle properties. 
Tech. rep., Department of Mathematical Sciences, Indiana University- Purdue University, In- 
dianapolis., 2007. 

[7] M. Grant and S. Boyd. Cvx: Matlab software for disciplined convex programming (web page 
and software). |http: / / stanford.edu/^boyd/cvx[ june 2009. 

[8] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs, recent ad- 
vances in learning and control (a tribute to m. vidyasagar), v. blondel, s. boyd, and h. kimura, 
editors, pages 95-110, lecture notes in control and information sciences, springer, 2008. 

[9] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 
corrected edition, July 2003. 

[10] J.-B. Hiriart-Urruty and C. Lemarechal. Convex analysis and minimization algorithms I. 
Grundlehren der Mathematischen Wissenschaften. 306. Berlin: Springer- Verlag. , 1991. 

[11] A. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. 
Technometrics, 12:55-67, 1970. 

[12] P. Huber. Robust Statistics. Wiley, New York, 1981. 

[13] K. Knight. Epi-convergence in distribution and stochastic equi-semicontinuity. In Corpus-based 
work, pages 33-50, 1997. 



25 



[14] K. Knight, and W. Fu. Asymptotics for Lasso-type estimators In Ann. Stat., pages 1356-1378, 
2000. 

[15] S. Lambert-Lacroix and L. Zwald. Robust regression through the Huber's criterion and adaptive 
lasso penalty. Electronic Journal of Statistics, 5:1015-1053, 2011. 

[16] C. Leng, Y. Lin, and G. Wahba. A note on the Lasso and related procedures in model selection. 
Stat. Sin., 16(4): 1273-1284, 2006. 

[17] L. McLinden and R. C. Bergstrom. Preservation of convergence of convex sets and functions 
in finite dimensions. Trans. Am. Math. Soc, 268:127-142, 1981. 

[18] N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection with the 
Lasso. Ann. Stat, 34(3): 1436-1462, 2006. 

[19] A. B. Owen. A robust hybrid of lasso and ridge regression. Technical report, 2006. 

[20] R. Rockafellar. Convex analysis. Princeton Landmarks in Mathematics. Princeton, NJ: Prince- 
ton University Press. , 1970. 

[21] S. Sardy, P. Tseng, and A. Bruce. Robust wavelet denoising. Signal Processing, IEEE Transac- 
tions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], 49(6): 1146- 
1152, 2001. 

[22] G. Schwarz. Estimating the dimension of a model. Ann. Stat., 6:461-464, 1978. 

[23] J. F. Sturm. Using SeDuMi 1. 02, a MATLAB toolbox for optimization over symmetric cones. 
1999. 

[24] T. A. Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A. 
and N. Yang. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of 
the prostate: II. radical prostatectomy treated patients. Journal of Urology., 141 (5): 1076-1083, 
1989. 

[25] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical 
Society, Series B, 58:267-288, 1996. 

[26] A. Van der Vaart. Asymptotic statistics. Cambridge Series in Statistical and Probabilistic 
Mathematics, 3. Cambridge, 1998. 

[27] A. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. With applica- 
tions to statistics. Springer Series in Statistics. New York, NY: Springer. , 1996. 

[28] H. Wang and C. Leng. Unified Lasso Estimation via Least Squares Approximation. JASA, 
102:1039-1048, 2007. 

[29] H. Wang, G. Li, and G. Jiang. Robust regression shrinkage and consistent variable selection 
through the LAD-Lasso. Journal of Business & Economic Statistics, 25(3):347-355, 2007. 



26 



[30] H. Wang, R. Li, and C. Tsai. Tuning parameter selectors for the smoothly clipped absolute 
deviation method. Biometrika, 94,3:553-568, 2007. 



[31] M. Yuan, M. Yuan, Y. Lin, and Y. Lin. Model selection and estimation in regression with 
grouped variables. Journal of the Royal Statistical Society, Series B, 68:49-67, 2006. 

[32] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped 
and hierarchical variable selection. Vol., (|arXiv:0909.041lj IMS-AOS-AO S584), Sep 
2009. Com ments: Published in at |http://dx.doi.org/10.1214/07-AOS584| the Annals 
of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics 
( htt p: / / www.imstat.org[ ) . 

[33] P. Zhao and B. Yu. On Model Selection Consistency of Lasso. Technical report, University of 
California, Berkeley. Dept. of Statistics, 2006. 

[34] H. Zou. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical 
Association, 101 (476): 1418-1429, 2006. 

[35] H. Zou and T. Hastie. Regularization and variable selection via the Elastic Net. Journal of the 
Royal Statistical Society B, 67(2):301-320, 2005. 

[36] H. Zou and H. H. Zhang. On the adaptive elastic net with a diverging number of parameters. 
Ann. Stat., 37(4):1733-1751, 2009. 



27 



Tables and Figures 



Table 1: Selection model ability on Model 1 based on 100 replications. 
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Table 2: Selection model ability on model 2 based on 100 replications. 
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Table 3: Selection model ability on model 3 based on 100 replications. 
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RPE (model 1, n=100) 




RPE (model 2, n=100) 
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Figure 1: For n = 100, RPE for ad-lasso (1), ridge (r), ad-en (e), ad-Berhu (b), Huber-ad-lasso 
(hi), Huber-ridge (hr), Huber-ad-en (he), and Huber-ad-Berhu (hb). The boxplots are obtained 
without extreme values given by, for model 1 hi: 2.87; model 2 b: 2.95, hi: 2.94, he: 794.15; model 
3 hi: 2.58. 
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RPE (model 1, n=200) 




RPE (model 2, n=200) 




RPE (model 3, n=200) 



o 
o 




hi 



hr 



I 

he 



hb 



Figure 2: For n = 200, RPE for ad-lasso (1), ridge (r), ad-en (e), ad-Berhu (b), Huber-ad-lasso 
(hi), Huber-ridge (hr), Huber-ad-en (he), and Huber-ad-Berhu (hb). The boxplots are obtained 
without extreme values given by, for model 1 b: 2.95, hi: 2.48, 2.95,12.79, 2.86, 2.54, 2.96, 2.95; 
model 3 b: 2.95, 2.95, 2.95, hi: 2.95, 2.94, 2.95, 2.51, 49.03. 
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RPE (model 1, n=400) 
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Figure 3: For n = 400, RPE for ad-lasso (1), ridge (r), ad-en (e), ad-Berhu (b), Huber-ad-lasso 
(hi), Huber-ridge (hr), Huber-ad-en (he), and Huber-ad-Berhu (hb). The boxplots are obtained 
without extreme values given by, for model 1 b: 2.95, 2.95, hi: 2.95, 2.49, 2.90, 2.95, 2.94, 2.95, 
2.95, 2.93; model 2 b: 2.95, 2.95, 0.99; model 3 b: 2.95, 2.95, hi: 8.97, 2.95. 
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Estimation of the first coefficient model 1, n=100 
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Estimation of the first coefficient model 1, n=200 
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Estimation of the first coefficient model 1, n=400 
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Figure 4: Model 1: estimations of first influencing coefficient (true value is equal to 3) by ad-lasso 
(1), ridge (r), ad-en (e), ad-Berhu (b), Huber-ad-lasso (hi), Huber-ridge (hr), Huber-ad-en 
(he), and Huber-ad-Berhu (hb). 
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Estimation of the first coefficient model 2, n=100 
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Figure 5: Model 2: estimations of first influencing coefficient (true value is equal to 3) by ad-lasso 
(1), ridge (r), ad-en (e), ad-Berhu (b), Huber-ad-lasso (hi), Huber-ridge (hr), Huber-ad-en 
(he), and Huber-ad-Berhu (hb). 
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Estimation of the first coefficient model 3, n=100 




Estimation of the first coefficient model 3, n=200 
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Estimation of the first coefficient model 3, n=400 
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Figure 6: Model 3: estimations of first influencing coefficient (true value is equal to 3) by ad-lasso 
(1), ridge (r), ad-en (e), ad-Berhu (b), Huber-ad-lasso (hi), Huber-ridge (hr), Huber-ad-en 
(he), and Huber-ad-Berhu (hb). 
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Table 4: Prostate cancer data: comparing methods 
Methods mean of 100 parameters (std of the 100) mean of 100 RPE (std of the 100 ) 



OLS 


none 


0.6054(0.1397) 








Least square criterion 


ad-lasso 


A n : 


2.4177(1.7368) 


0.6357(0.1410) 


ridge 


A n : 


2.6104(2.3111) 


0.6145(0.1406) 


ad-en 


Al, n 


: 1.1361(1.0048), 


A 2 , n : 2.5032(10.2605) 0.6231(0.1351) 


ad-Berhu 


A n : 


1.9850(1.2782) 


0.6237(0.1423) 








Huber's criterion 


ad-lasso 


A n : 


26.2749(7.4369) 


0.7765(0.1879) 


ridge 


An ■ 


3.7437(3.5792) 


0.6020(0.1327) 


ad-en 


Al, n 


: 1.3885(1.5778), 


A 2 , n : 4.3222(14.2073) 0.6185(0.1295) 


ad-Berhu 


An ■ 


2.7456(1.9015) 


0.6322(0.1391) 
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Figure 7: Prostate cancer data: histogram associated with number of selection of each variables in 
the re-sampling study. 
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